ESTIMATION OF LOW-RANK TENSORS VIA CONVEX 
OPTIMIZATION* 
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Abstract. In this paper, we propose three approaches for the estimation of the Tucker decom- 
position of multi-way arrays (tensors) from partial observations. All approaches are formulated as 
convex minimization problems. Therefore, the minimum is guaranteed to be unique. The proposed 
approaches can automatically estimate the number of factors (rank) through the optimization. Thus, 
there is no need to specify the rank beforehand. The key technique we employ is the trace norm 
regularization, which is a popular approach for the estimation of low-rank matrices. In addition, 
we propose a simple heuristic to improve the interpretability of the obtained factorization. The 
advantages and disadvantages of three proposed approaches are demonstrated through numerical 
experiments on both synthetic and real world datasets. We show that the proposed convex opti- 
mization based approaches are more accurate in predictive performance, faster, and more reliable in 
recovering a known multilinear structure than conventional approaches. 

1. Introduction. Multi-way data analysis have recently become increasingly 
popular supported by modern computational power [23, 37]. Originally developed in 
the field of psychometrics and chemometrics, its applications can now also be found 
in signal processing (for example, for independent component analysis) [14], neuro- 
science [29], and data mining [28]. Decomposition of multi-way arrays (or tensors) into 
small number of factors have been one of the main concerns in multi-way data analysis, 
because interpreting the original multi-way data is often impossible. There are two 
popular models for tensor decomposition, namely the Tucker decomposition [44, 13] 
and the C ANDECOMP /PARAFAC (CP) decomposition [11, 19]. In both cases, con- 
ventionally the estimation procedures have been formulated as non-convex optimiza- 
tion problems, which are in general only guaranteed to converge locally and could 
potentially suffer from poor local minima. Moreover, a popular approach for Tucker 
decomposition known as the higher order orthogonal iteration (HOOI) may converge 
to a stationary point that is not even a local minimizer [15]. 

Recently, convex formulations for the estimation of low-rank matrix, which is a 
special case of tensor, have been intensively studied. After the pioneering work of Fazel 
et al. [16] , convex optimization has been used for collaborative filtering [38] , multi-task 
learning [3], and classification over matrices [40]. In addition, there are theoretical 
developments that (under some conditions) guarantee perfect reconstruction of a low- 
rank matrix from partial measurements via convex estimation [10, 32]. The key idea 
here is to replace the rank of a matrix (a non-convex function) by the so-called trace 
norm (also known as the nuclear norm) of the matrix. One goal of this paper is 
to extend the trace-norm regularization for more than two dimensions. There have 
recently been related work by Liu et al. [27] and Signorctto ct al. [36] , which correspond 
to one of the proposed approaches in the current paper. 

In this paper, we propose three formulations for the estimation of low rank tensors. 
The first approach is called "as a matrix" and estimates the low-rank matrix that is 
obtained by unfolding (or matricizing) the tensor to be estimated; thus this approach 
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basically treats the unknown tensor as a matrix and only works if the tensor is low- 
rank in the mode used for the estimation. The second approach called "constraint" 
extends the first approach by incorporating the trace norm penalties with respect to 
all modes simultaneously. Therefore, there is no arbitrariness in choosing a single 
mode to work with. However, all modes being simultaneously low-rank might be a 
strong assumption. The third approach called "mixture" relaxes the assumption by 
using a mixture of K tensors, where K is the number of modes of the tensor. Each 
tensor is regularized to be low-rank in each mode. 

We apply the above three approaches to the reconstruction of partially observed 
tensors. In both synthetic and real-world datasets, we show the superior predictive 
performance of the proposed approaches against conventional expectation maximiza- 
tion (EM) based estimation of Tucker decomposition model. We also demonstrate the 
effectiveness of a heuristic to improve the interpretability of the core tensor obtained 
by the proposed approaches on the amino acid fluorescence dataset. 

This paper is structured as follows. In the next section, we first review the 
matrix rank and its relation to the trace norm. Then we review the definition of 
tensor mode-fc rank, which suggests that a low rank tensor is a low rank matrix 
when appropriately unfolded. In Section 3, we propose three approaches to extend 
the trace-norm regularization for the estimation of low-rank tensors. In Section 4, 
we show that the optimization problems associated to the proposed extensions can 
be solved efficiently by the alternating direction method of multipliers [17]. In Sec- 
tion 5, we show through numerical experiments that one of the proposed approaches 
can recover a partly observed low-rank tensor almost perfectly from smaller fraction 
of observations compared to the conventional EM-based Tucker decomposition algo- 
rithm. The proposed algorithm shows a sharp threshold behaviour from a poor fit to 
a nearly perfect fit; we numerically show that the fraction of samples at the thresh- 
old is roughly proportional to the sum of the fc-ranks of the underlying tensor when 
the tensor dimension is fixed. Finally we summarize the paper in Section 6. Earlier 
version of this manuscript appeared in NIPS2010 workshop "Tensors, Kernels, and 
Machine Learning" . 

2. Low rank matrix and tensor. In this section, we first discuss the connec- 
tion between the rank of a matrix and the trace-norm regularization. Then we review 
the CP and the Tucker decomposition and the notions of tensor rank connected to 
them. 

2.1. Rank of a matrix and the trace norm. The rank r of an R x C matrix 
X can be defined as the number of nonzero singular values of X. Here, the singular- 
value decomposition (SVD) of X is written as follows: 

X = C/diag( ( 7 1 (X), ( 7 2 (X),...,a r (X))V T , (2.1) 

where U G M. Rxr and V G R Cxr are orthogonal matrices, and <Jj(X) is the jth 
largest singular- value of X. The matrix X is called low-rank if the rank r is less 
than min(_R, C). Unfortunately, the rank of a matrix is a nonconvex function, and 
the direct minimization of rank or solving a rank-constrained problem is an NP-hard 
problem [32]. 

The trace norm is known to be the tightest convex lower bound of matrix rank [32] 
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(see Fig. 2.1) and is denned as the linear sum of singular values as follows: 



||X|U = ]>>,(X). 

Intuitively, the trace norm plays the role of the ^i-norm in the subset selection prob- 
lem [39] , for the estimation of low-rank matrix 1 . The convexity of the above function 
follows from the fact that it is the dual norm of the spectral norm || • || (see [6, Section 
A. 1.6]). Since it is a norm, the trace norm || • ||* is a convex function. The non- 
diffcrcntiability of the trace norm at the origin promotes many singular values of X 
to be zero when used as a regularization term. In fact, the following minimization 
problem has an analytic solution known as the spectral soft-thresholding operator (see 
[8]): 

argmin \\\X - Y\\ 2 Fm + X\\X\l, (2.2) 
x * 

where || • \\-p ro is the Frobenius norm, and A > is a regularization constant. The 
spectral soft-thresholding operation can be considered as a shrinkage operation on 
the singular values and is defined as follows: 

prox^(Y) = Umax(S- X,0)V T , (2.3) 

where Y = USV T is the SVD of the input matrix Y, and the max operation is taken 
element-wise. We can see that the spectral soft-thresholding operation truncates the 
singular- values of the input matrix Y smaller than A to zero, thus the resulting matrix 
X is usually low-rank. See also [42] for the derivation. 

For the recovery of partially observed low-rank matrix, some theoretical guaran- 
tees have recently been developed. Candes and Recht [10] showed that in the noiseless 
case, 0(n 6 / 5 r log(n)) samples are enough to perfectly recover the matrix under uni- 
form sampling if the rank r is not too large, where n = max(i?, C). 

2.2. Rank of a tensor. For higher order tensors, there are several definitions of 
rank. Let X G Rmxn 2 x-xn K be a ^-way tensor. The rank of a tensor (sec [24]) is 
defined as the minimum number r of components required for rank-one decomposition 
of a given tensor X in analogy to SVD as follows: 

X = j^X ja fKaf o.-.o af>, 

= Ax 1 A (1) x 2 A (2) -.-x Jf 4 w (2.4) 

where o denotes the outer product, A 6 R rx "" xr denotes a if- way diagonal matrix 
whose (j,j,j)th element is Xj, and x& denotes the fc-mode matrix product (see Kolda 

& Bader [23]); in addition, we define = [a^ , . . . , a^]. The above decomposition 
model is called CANDECOMP [11] or PARAFAC [19]. It is worth noticing that 
finding the above decomposition with the minimum r is a hard problem; thus there 
is no straightforward algorithm for computing the rank for higher-order tensors [24] . 

1 Note however that the absolute value is not taken here because singular value is defined to be 
positive. 
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Fig. 2.1. Penalty functions \x\ p over one singular value x are schematically illustrated for 
various p. The absolute penalty function \x\ is the tightest convex lower bound of the rank (p — I 0) 
in the interval [—1,1]. 

We consider instead the mode-fc rank of tensors, which is the foundation of the 
Tucker decomposition [44, 13]. The mode-fc rank of X, denoted rankfc(Af), is the 
dimensionality of the space spanned by the mode-fc fibers of X. In other words, the 
mode-fc rank of X is the rank of the mode-fc unfolding -X"(fe) of X. The mode-fc 
unfolding X(f.) is the rik x n\ k (n\ k := Yik'^k n k') matrix obtained by concatenating 
the mode-fc fibers of X as column vectors. In MATLAB this can be obtained as 
follows: 

X=permute(X, [k:K, 1 :k-l] ) ; X=X(: , :) ; 

where the order of dimensions other than the first dimension fc is not important as long 
as we use a consistent definition. We say that a if- way tensor X is rank-(ri, . . . , tk) 
if the mode-fc rank of X is r% (k = 1, . . . , K). Unlike the rank of the tensor, mode-k 
rank is clearly computable; the computation of the mode-fc ranks of a tensor boils 
down to the computation the rank of K matrices. 

A rank-(ri, . . .,tk) tensor X G flj»ix— xra.se can be written as 

X = Q x 1 U 1 x 2 U 2 ---x K U K , (2.5) 

where Q e R nx '" x, " K is called a core tensor, and Uk e R n *x r * ( n = 1,...,K) 
are left singular-vectors from the SVD of the mode-fc unfolding of X. The above 
decomposition is called the Tucker decomposition [44, 23]. 

The definition of a low-rank tensor (in the sense of Tucker decomposition) implies 
that a low-rank tensor is a low-rank matrix when unfolded appropriately. In order to 
see this, we recall that for the Tucker model (2.5), the mode-fc unfolding of X can be 
written as follows (see e.g., [23]): 

X [k) = U k G (k) (U k -i ® • • • ® Ui <g> U K ® • • • (g> U k +i) T . 

Therefore, if the tensor X is low-rank in the fcth mode (i.e., < min(rifc, n\k)), 
its unfolding is a low-rank matrix. Conversely, if -X'(fe) is a low-rank matrix (i.e., 
Xtf.) = USV T ), we can set Uk = U, Gha = SV T , and other Tucker factors Uk' 
(fc' 7^ fc) as identity matrices, and we obtain a Tucker decomposition (2.5). 
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Note that if a given tensor X can be written in the form of CP decomposition (2.4) 
with rank r, we can always find a rank-(r, r, . . . , r) Tucker decomposition (2.5) by 
ortho-normalizing each factor in Equation (2.4). Therefore, the Tucker decom- 
position is more general than the CP decomposition. 

However, since the core tensor Q that corresponds to singular-values in the ma- 
trix case (see Equation (2.1)) is not diagonal in general, it is not straightforward to 
generalize the trace norm from matrices to tensors. 

3. Three strategies to extend the trace-norm regularization to tensors. 

In this section, we first consider a given tensor as a matrix by unfolding it at a given 
mode k and propose to minimize the trace norm of the unfolding X( k y Next, we 
extend this to the minimization of the weighted sum of the trace norms of the unfold- 
ings. Finally, relaxing the condition that the tensor is jointly low-rank in every mode 
in the second approach, we propose a mixture approach. For solving the optimization 
problems, we use the alternating direction method of multipliers (ADMM) [17] (also 
known as the split Brcgman iteration [18]). The optimization algorithms are discussed 
in Section 4. 

3.1. Tensor as a matrix. If we assume that the tensor we wish to estimate is 
(at least) low-rank in the fcth mode, we can convert the tensor estimation problem 
into a matrix estimation problem. Extending the minimization problem (2.2) to 
accommodate missing entries we have the following optimization problem for the 
reconstruction of partially observed tensor: 

. m „ ini x miz x e ^M x ) -y\\ 2 + ll x «H*' ( 3J ) 

AfeR"i x "" x "^ 2A 

where -X'(fe) is the mode-fc unfolding of X , y e K M is the vector of observations, and 
: R™ix - xn K _^ j^Af - g a i mear operator that reshapes the prespecified elements 
of the input tensor into an M dimensional vector; M is the number of observations. 
In Equation (3.1), the regularization constant A > is moved to the denominator 
of the loss term from the numerator of the regularization term in Equation (2.2); 
this equivalent reformulation allows us to consider the noiseless case (A — > 0) in the 
same framework. Note that A can also be interpreted as the variance of the Gaussian 
observation noise model. 

Since the estimation procedure (3.1) is essentially an estimation of a low-rank 
matrix X( fe ), we know that in the noiseless case 0(n^/ 5 rk \og(fik)) samples are enough 
to perfectly recover the unknown true tensor X* , where — rank^(Af*) and fik = 
max(rjfc, n\ fc ), if the rank is not too high [10]. This holds regardless of whether 
the unknown tensor X is low-rank in other modes fc' ^ k. Therefore, when we can 
estimate the mode-fc unfolding of X* perfectly, we can also recover the whole X* 
perfectly, including the ranks of the modes we did not use during the estimation. 

However, the success of the above procedure is conditioned on the choice of the 
mode to unfold the tensor. If we choose a mode with a large rank, even if there are 
other modes with smaller ranks, we cannot hope to recover the tensor from a small 
number of samples. 

Various advanced methods [42, 43, 25, 22] for the estimation of low-rank matrices 
can be used for solving the minimization problem (3.1). Here we use ADMM to keep 
the presentation concise; see Section 4 for the details. 

3.2. Constrained optimization of low rank tensors. In order to exploit 
the rank deficiency of more than one mode, it is natural to consider the following 
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extension of the estimation procedure (3.1) 



minimize ^r\MX) - yf + V lk \\X {k) ||*. (3.2) 

/c— 1 

This is a convex optimization problem, because it can be reformulated as follows: 

1 K 
minimize — ||Oas - y\\ 2 + Vlfcll-^H*, (3.3) 

fc— 1 

subject to -Pfccc = 2;*; (fc = 1, . . . , if), (3.4) 

where x £ is the vectorization of AT (N = Yi k =i nfc )' -''fc * s * ne ma t r i x representa- 
tion of mode-fc unfolding (note that P k is a permutation matrix; thus Pk T Pk = In), 
Z k £ K™fc xn \fc i s an auxiliary matrix of the same size as the mode-A: unfolding of X, 
and Zk is the vectorization of Z k . With a slight abuse of notation ft £ R MxN denotes 
the observation operator as a matrix. 

This approach was considered earlier by Liu et al. [27] and Signoretto et al. [36]. 
Liu et al. relaxed the constraints (3.4) into penalty terms, therefore the factors ob- 
tained as the left singular vectors of Z k docs not equal the factors of the Tucker 
decomposition of X. Signoretto et al. have discussed the general Shatten-{jj, q} 
norms for tensors and the relationship between the regularization term in Equa- 
tion (3.2) with 7^ = 1/K (which corresponds to Shatten-{1, 1} norm) and the function 

-kT,k=i Tank k( x )- 

3.3. Mixture of low-rank tensors. The optimization problem (3.3) penalizes 
every mode of the tensor X to be jointly low-rank, which might be too strict to be 
satisfied in practice. Thus we propose to predict instead with a mixture of K tensors; 
each mixture component is regularized by the trace norm to be low-rank in each mode. 
More specifically, we solve the following minimization problem: 

2 K 

+ ^2lk\\Z k \U. (3.5) 

k=l 

Note that when z k = -^P k x for all k = 1, . . . , K, the problem (3.5) reduces to the 
problem (3.3) with Y k = "f k /K. 

3.4. Interpretation. All three proposed approaches inherit the lack of unique- 
ness of the factors from the conventional Tucker decomposition [23] . Some heuristics 
to improve the interpretability of the core tensor Q are proposed and implemented 
in the TV- way toolbox [2]. However, these approaches are all restricted to orthog- 
onal transformations. Here we present another simple heuristic, which is to apply 
PARAFAC decomposition on the core tensor Q. This approach has the following 
advantages over applying PARAFAC directly to the original data. First, the dimen- 
sionality of the core tensor (r\, . . . , tk) is automatically obtained from the proposed 
algorithms. Therefore, the range of the number of PARAFAC components that we 
need to look for is much narrower than applying PARAFAC directly to the original 
data. Second, the PARAFAC problem does not need to take care of missing en- 
tries. In other words, we can separate the prediction problem and the interpretation 
problem, which are separately tackled by the proposed algorithms and PARAFAC, 
respectively. Finally, empirically the proposed heuristic seems to be more robust in 
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minimize 

Zi,...,Z K 



1 

2A 



Zk 



recovering the underlying factors compared to applying PARAFAC directory when 
the rank is misspecified (see Section 5.2). 

More precisely, let us consider the second "Constraint" approach. Let U\ . . . , U k 
be the left singular vectors of the auxiliary variables Zi,..., Zk- From Equation (2.5) 
we can obtain the core tensor Q as follows: 

g = xx 1 u 1 T x 2 u 2 T ■■■ x K u K T . 

Let . . . , be the factors obtained by the PARAFAC decomposition of Q as 
follows: 

e = A Xl A« x 2 A^---x K A^ K \ 
Therefore, we have the following decomposition 

X = A Xl (E/iA«) x 2 (U 2 A^) ...x K (U k AW), (3.6) 

which gives the kth. factor as UkA^ k \ 

4. Optimization. In this section, we describe the optimization algorithms based 
on the alternating direction method of multipliers (ADMM) for the problems (3.1), 
(3.3), and (3.5). 

4.1. ADMM. The alternating direction method of multipliers [17] (see also [5]) 
can be considered as an approximation of the method of multipliers [31, 21] (see also 
[4, 30]). The method of multipliers generates a sequence of primal variables (x*,z*) 
and multipliers a* by iteratively minimizing the so called augmented Lagrangian (AL) 
function with respect to the primal variables (x*,z*) and updating the multiplier 
vector a*. Let us consider the following linear equality constrained minimization 
problem: 

minimize f(x) + q(z), (4.1) 

subject to Ax — z, (4.2) 

where / and g are both convex functions. The AL function L v (x, z, a) of the above 
minimization problem is written as follows: 

L„(sb, z, a) = f(x) + g(z) + a 1 {Ax - z ) + l\\Ax- z\\ 2 , 

where a. 6 M m is the Lagrangian multiplier vector. Note that when 77 = 0, the 
AL function reduces to the ordinary Lagrangian function. Intuitively, the additional 
penalty term enforces the equality constraint to be satisfied. However, different from 
the penalty method (which was used in [27]), there is no need to increase the penalty 
parameter rj very large, which usually makes the problem poorly conditioned. 

The original method of multipliers performs minimization of the AL function with 
respect to x and z jointly followed by a multiplier update as follows: 

{x t+1 ,z t+1 ) = argmin ^(jc.z.a*), (4.3) 

x£R",zeR m 

a t+1 = a t + r ) {Ax t+1 - z t+1 ). (4.4) 

Intuitively speaking, the multiplier is updated proportionally to the violation of the 
equality constraint (4.2). In this sense, rj can also be regarded as a step-size parameter. 
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Under fairly mild conditions, the above method converges super-linearly to a solution 
of the minimization problem (4.1); sec [34, 41]. However, the joint minimization of 
the AL function (4.3) is often hard (see [41] for an exception). 

The ADMM decouples the minimization with respect to x and z as follows: 

x t+1 — argminL^x, z* , a*), (4.5) 

z t+1 = &iguimL 71 {x t+1 ,z,OL t ), (4.6) 

a t+1 = a f +r](Ax t+1 -z t+1 ). (4.7) 

Note that the new value of x t+1 obtained in the first line is used in the update 
of z t+1 in the second line. The multiplier update step is identical to that of the 
ordinary method of multipliers (4.4). It can be shown that the above algorithm is an 
application of firmly nonexpansive mapping and that it converges to a solution of the 
original problem (4.1). Surprisingly, this is true for any positive penalty parameter 
T) [26] . This is in contrast to the fact that a related approach called forward-backward 
splitting [26] (which was used in [36]) converges only when the step-size parameter r\ 
is chosen appropriately. 

4.2. Stopping criterion. As a stopping criterion for terminating the above 
ADMM algorithm, we employ the relative duality gap criterion; that is, we stop 
the algorithm when the current primal objective value p(x,z) := f(x) + g(z) and 
the largest dual objective value max t / = i i ... ;t d(a* ) obtained in the past satisfies the 
following equality 

(p(x*,z*) - max d(a*'))/p(a;S;8*) < e. (4.8) 

Note that the multiplier vector a* computed in Equation (4.7) cannot be directly 
used in the computation of the dual objective value, because typically a 1 violates the 
dual constraints. See Appendix A for the details. 

The reason we use the duality gap is that the criterion is invariant to the scale of 
the observed entries y and the size of the problem N . 

4.3. ADMM for the "As a Matrix" approach. We consider the following 
constrained reformulation of problem (3.1) 

minimize -^-||fia; — y\\ 2 + \\Z\\*, subject to PkX = z, (4.9) 

where x e R N is a vectorization of AT, Z 6 K"'""^ is an auxiliary variable that 
corresponds to the mode-fc unfolding of X, and z G R N is the vectorization of Z. 
The AL function of the above constrained minimization problem can be written as 
follows: 

L v (x, Z, a) = ^\\^x -y\\ 2 + ||Z||* + rja T {P k x - z) + ^\\P k x - z\\ 2 , (4.10) 

where a e R N is the Lagrangian multiplier vector that corresponds to the constraint 
PkX = z. Note that we rescaled the Lagrangian multiplier vector a by the factor rj 
for the sake of notational simplicity. 
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Starting from an initial point (x°,Z°,a ), we apply the ADMM explained in 
the previous section to the AL function (4.10). All the steps (4.5)-(4.7) can be 
implemented in closed forms. First, minimization with respect to x yields, 



x 



' ' - ( n T y + Ar ? P fc T (z t - a 1 )) ./(l n + \r)l N ), (4.11) 



where In is an iV-dimensional vector that has one for observed elements and zero 
otherwise; ljv is an iV-dimensional vector filled with ones; ./ denotes element-wise 
division. Note that when A — > (no observational noise), the above expression can 
be simplified as follows: 



pt+1 _ ){Sl T y)u iefi, 
(P fe T (z* - a*))i, t^n 



^^rnTM (<=i,...,^. (4.i2) 



Here the observed entries of x are overwritten by the observed values y and the unob- 
served entries are filled with the modc-fc tensorization of the current prediction z* — a*. 
In the general case (4.11), the predicted values also affect the observed entries. The 
primal variable x l and the auxiliary variable z* becomes closer and closer as the opti- 
mization proceeds. This means that eventually the multiplier vector a 1 takes non-zero 
values only on the observed entries when A — > 0. 

Next, the minimization with respect to Z yields, 

Z t+1 =prc< /T) (P fcaJ t + 1 +<*«), 

where prox*y is the spectral soft-threshold operation (2.3) in which the argument 
P k x t+1 + ol 1 is considered as a n k x h\ k matrix. 

The last step is the multiplier update (4.7), which can be written as follows: 



- ~* + (P k x t+1 - z t+1 ) . (4.13) 



a. = a 



Note that the step-size parameter r] does not appear in (4.13) due to the rescaling of 
a in (4.10). 

The speed of convergence of the algorithm mildly depends on the choice of the 
step-size r\. Here as a guideline to choose 77, we require that the algorithm is invariant 
to scalar multiplication of the objective (4.9). More precisely, when the input y and 
the regularization constant A are both multiplied by a constant c, the solution of 
the minimization (4.9) (or (3.1)) should remain essentially the same as the original 
problem, except that the solution x is also multiplied by the constant c. In order to 
make the algorithm (see (4.11)-(4.13)) follow the same path (except that x*, z*, and 
a* are all multiplied by c), we need to scale 77 inversely proportional to c. We can also 
see this in the AL function (4.10); in fact, the first two terms scale linearly to c, and 
also the last two terms scale linearly if i] scales inversely to c. Therefore we choose r] 
as r] — ?7o/std(y), where 770 is a constant and std(y) is the standard deviation of the 
observed values y. 

4.4. ADMM for the "Constraint" approach. The AL function of the con- 
strained minimization problem (3.3)- (3.4) can be written as follows: 

1 K 

l v (x, {z k }ti, Mti) = ^lin* - y\\ 2 + E -Yk\\z k \U 

k=l 

K 



+ 2 (w*k T (PkX - z k ) + ^\\P k x - z fc || 2 ) . 



fc=i 
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Note that we rescaled the multiplier vector a by the factor rj as in the previous 
subsection. 

Starting from an initial point (x°, {Z k } k=1 , {a k } k=1 ), we take similar steps as in 
(4.11)-(4.13) except that the last two steps are performed for all k = 1, . . . , K . That 
is, 

x t+1 = (Vy + Ary^T^Pfe^-c^ ./{l a + Xr,Kl N ), (4.14) 

Zl +1 = prox^ (P k x t+1 + a* k ) (k = l,...,K), (4.15) 

«' fe +1 = <4 + (PkX t+1 - zl +1 ) (k = l,...,K). (4.16) 

By considering the scale invariance of the algorithm, we choose the step-size 77 as 
rj = r] /std(y) as in the previous subsection. 

4.5. ADMM for the "Mixture" approach. We consider the following dual 
problem of the mixture formulation (3.5): 

A K 
minimize - ||a|| 2 - a T y + V 5 lk (W k ), (4.17) 

subject to Wk = Pkfl T (x (k = 1,...,K), 

where a e M. M is a dual vector; W k e R n " xn \k is an auxiliary variable that corre- 
sponds to the mode-A; unfolding of fi T a, and w k e R N is the vectorization of W k ; 
the indicator function 5\ is defined as S\(W) = 0, if ||W|| < A, and 6\(W) = +00, 
otherwise, where || • || is the spectral norm (maximum singular-value of a matrix). 
The AL function for the problem (4.17) can be written as follows: 

L„(a, {W k }« =1 , {z fe }f =1 ) = ^||a|| 2 - « T y + £ 5 lk (W k ) 

1 fe=i 

K 



+ (z k T (P k n T a - w k ) + ^\\P k n T a - w k \\ 2 ) 



k=l 

Similar to the previous two algorithms, we start from an initial point (a. , {W° k } k=l , {z k } k=1 ), 
and compute the following steps: 

a t+1 = argmin^(a,{^}f =1 ,{z* fe }f =1 ) 

a 

W{ +1 = argmini,(a'+\{W fe }f =1 ,{z<}f=i) 
w k 

4 +1 = 4 + v(P k n T cx t+1 w\ +1 ). (4.18) 
The above steps can be computed in closed forms. In fact, 

Wl +1 = proj 7fc (P k n T a t+1 + zl/ V ), (4.20) 

where the projection operator proj A is the projection onto a radius A-spectral-norm 
ball, as follows: 

proj A (u;) := J7min(5, A)V T , 
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where W = USV T is the SVD of the matricization of the input vector w. Moreover, 
combining the two steps (4.20) and (4.18), we have (see [41]) 

4+ 1 = prax£„ (4 + ^0 T a i+1 ) . (4.21) 

Note that we recover the spectral soft-threshold operation prox^f by combining the 
two steps. Therefore, we can simply iterate steps (4.19) and (4.21) (note that the 
term r\w\ in (4.19) can be computed from (4.18) as r\w k — z l k x + r)P k £l T a* — zt.) 

In order to see that the multiplier vector z\ obtained in the above steps converges 
to the primal solution of the mixture formulation (3.5), we take the derivative of the 
ordinary Lagrangian function Lq with respect to a. and W k (ft = 1, ... , K) and obtain 
the following optimality conditions: 

P k Sl T cx E d-Y k \\Z k \U (k = l,...,K), 

where we used the relationship w k = P k ^i T ot, and the fact that d8 lk (W k ) 3 z k 
implies w k E (?7fc||.Zfc||* because the two functions 8 lk and 7fc|| • ||* are conjugate to 
each other; see [33, Cor. 23.5.1]. By combining the above two equations, we obtain 
the optimality condition for the mixture formulation (3.5) as follows: 

-jP k n T (y-n^ =i P k T z^j +dj k \\z k \\* go (k = i,...,K). 

As in the previous two subsections, we require that the algorithm (4.18)-(4.21) 
is invariant to scalar multiplication of the input y and the regularization constant A 
by the same constant c. Since z\ appears in the final solution, z\ must scale linearly 
with respect to c. Thus from (4.18), if a k and w\ are constants with respect to c, the 
step-size 77 must scale linearly. In fact, from (4.19) and (4.20), we can see that these 
two dual variables remain constant when y, z k , and 77 are multiplied by c. Therefore, 
we choose 77 = std(y)/?7o. 

5. Numerical experiments. In this section, we first present results on two 
synthetic datasets. Finally we apply the proposed methods to the Amino acid fluo- 
rescence data published by Bro and Andersson [7] . 

5.1. Synthetic experiments. We randomly generated a rank-(7,8,9) tensor of 
dimensions (50,50,20) by drawing the core from the standard normal distribution and 
multiplying its each mode by an orthonormal factor randomly drawn from the Haar 
measure. We randomly selected some elements of the true tensor for training and 
kept the remaining elements for testing. We used the algorithms described in the 
previous section with the tolerance e = 10~ 3 . We choose j k = 1 for simplicity in 
the later two approaches. The step-size ij is chosen as 77 = ?7o/std(y) for the first 
two approaches and 77 = std(y)/?7o for the third approach with 770 = 0.1. For the 
first two approaches, A — > (zero observation error) was used; see (4.12). For the 
last approach, we used A — 0. The Tucker decomposition algorithm tucker from the 
A^-way toolbox [2] is also included as a baseline, for which we used the correct rank 
( "exact" ) and the 20% higher rank ( "large" ) . Note that all proposed approaches can 
find the rank automatically. The generalization error is defined as follows: 

I V prod ~ V tost II 

error = — , 

Ill/tost II 



11 




Fig. 5.1. Comparison of three strategies, tensor as a matrix ("As a Matrix"), constrained 
optimization ("Constraint"), and mixture of low-rank tensors ("Mixture") on a synthetic rank- 
(7,8,9) tensor (the dimensions are 50 X 50 X 20). Also the Tucker decomposition with 20% higher 
rank ("large") and with the correct rank ("exact") implemented in the N-way toolbox [2] are included 
as baselines. The generalization error is plotted against the fraction of observed elements of the 
underlying low-rank tensor. Also the tolerance of optimization (W~ 3 ) is shown. 



As a Matrix 

Constraint 




Mixture 
Tucker (large) 
Tucker (exact) 



0.2 0.4 0.6 0.8 1 

Fraction of observed elements 



Fig. 5.2. Comparison of computation times. 



where y tcst is the vectorization of the unobserved entries and t/ prc( j is the prediction 
computed by the algorithms. For the "As a Matrix" strategy, error for each mode 
is reported. All algorithms were implemented in MATLAB and ran on a computer 
with two 3.5GHz Xeon processors and 32GB of RAM. The experiment was repeated 
20 times and averaged. 

Figure 5.1 shows the result of tensor completion using three strategies we pro- 
posed above, as well as the Tucker decomposition. At 35% observation, the proposed 
"Constraint" obtains nearly perfect generalization. Interestingly there is a sharp tran- 
sition from a poor fit (generalization error> 1) to an almost perfect fit (generalization 
error~ 10~ 3 ). The "As a Matrix" approach also show similar transition for mode 1 
and mode 2 (around 40%), and mode 3 (around 80%), but even the first transition is 
slower than the "Constraint" approach. The "Mixture" approach shows a transition 
around 70% slightly faster than the mode 3 in the "As A Matrix" approach. Tucker 
shows early decrease in the generalization error, but when the rank is misspecified 
("large"), the error remains almost constant; even when the correct rank is known 
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Fig. 5.3. Fraction of observations at the threshold plotted against the sum of true ranks. 
Numbers in the brackets denote the k-rank of the underlying tensor. The dimension of the tensor is 
(50,50,20). 



( "exact" ) , the convergence is slower than the proposed "Constraint" approach. 

The proposed convex approaches are not only accurate but also fast. Fig. 5.2 
shows the computation time of the proposed approaches and EM-based Tucker de- 
composition against the fraction of observed entries. For the "As a Matrix" approach 
the total time for all modes is plotted. We can see that the "As a Matrix" and "Con- 
straint" approaches are roughly 4-10 times faster than the conventional EM-based 
tucker decomposition. 

We have further investigated the condition for the threshold behaviour using the 
proposed "Constraint" approach. Here we generated different problems of different 
core dimensions (ri,r2,rs). The sum of mode-fc ranks is defined as min(ri, f2rs) + 
min(r2,r3Ti) + min(r3, rir-z)- For each problem, we apply the "Constraint" approach 
for increasingly large fraction of observations and determine when the generalization 
error falls below 0.01. Fig. 5.3 shows the fraction of observations required to obtain 
generalization error below 0.01 (in other words, the fraction at the threshold) against 
the sum of mode-fc ranks defined above. We can see that the fraction at the threshold 
is roughly proportional to the sum of the mode-fc ranks of the underlying tensor. We 
do not have any theoretical argument to support this observation. Acar et al [1] also 
empirically discussed condition for successful recovery for the CP decomposition. 

Figure 5.4 show another synthetic experiment. We randomly generated a rank- 
(50, 50, 5) tensor of the same dimensions as above. We chose the same parameter 
values 7^ = 1, A — > 0, e = 10~ 3 , and 770 = 0.1. Here we can see that interestingly the 
"Constraint" approach perform poorly, whereas the "mode 3" and "Mixture" perform 
clearly better than other algorithms. It is natural that the "mode 3" approach works 
well because the true tensor is only low-rank in the third mode. In contrast, the 
"Mixture" approach can automatically detect the rank-deficient mode, because the 
regularization term in the formulation (3.5) is a linear sum of three (K — 3) penalty 
terms. The linear sum structure enforces sparsity across Z^. Therefore, in this case 
Z\ and Zi were switched off, and "Mixture" approach yielded almost identical results 
to the "mode 3" approach. 

5.2. Amino acid fluorescence data. The amino acid fluorescence data is a 
semi-realistic data contributed by Bro and Andersson [7], in which they measured 
the fluorescence of five laboratory-made solutions that each contain different amounts 
of tyrosine, tryptophan and phenylalanine. Since the "factors" are known to be the 
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Fig. 5.4. Synthetic experiment on a rank-(50, 50, 5) tensor of dimensions 50 X 50 X 20. See also 
Fig. 5.1. 

three amino acids, this is a perfect data for testing whether the proposed method can 
automatically find those factors. 

For the experiments in this subsection, we chose the same parameter setting 
7fc = f, A — > 0, e = 10~ 3 , and 770 = O.f. Setting A —> corresponds to assuming- 
no observational noise. This can be justified by the fact that the original data is 
already approximately low-rank (rank-(3, 3, 3)) in the sense of Tucker decomposition. 
The dimensionality of the original tensor is 201 x 61 x 5, which correspond to emis- 
sion wavelength (250-450 nm), excitation wavelength (240-300 nm), and samples, 
respectively. 

Fig. 5.5 show the generalization error obtained by the proposed approaches as 
well as EM-based Tucker and PARAFAC decompositions. Here PARAFAC is included 
because the dataset is originally designed for PARAFAC. We can see that the proposed 
"Constraint" approach show fast decrease in generalization error, which is comparable 
to the PARAFAC model knowing the correct dimension. Tucker decomposition of 
rank-(3, 3, 3) performs as good as PARAFAC models when more than half the entries 
are observed. However, a slightly larger rank- (4, 4, 4) Tucker decomposition could not 
decrease the error below 0.05. 

Fig. 5.6 show the factors obtained by fitting directly three-component PARAFAC 
model, four-component PARAFAC model, and applying a four component PARAFAC 
model to the core obtained by the proposed "Constraint" approach. The fraction of 
observed entries was 0.5. The two conventional approaches used EM iteration for 
the estimation of missing values. For the proposed model, the dimensionality of the 
core was 4x4x5; this was obtained by keeping the singular- values of the auxiliary 
variable Zk that are larger than 1% of its largest singular- value for each k = 1, . . . , K. 
Then we applied a four-component (fully-observed) PARAFAC model to this core and 
obtained the factors as in Equation (3.6). Interestingly, although the four component- 
PARAFAC model is redundant for this problem [7], the proposed approach seem to 
be more robust than applying four-component PARAFAC directly to the data. We 
can see that the shape of the major three components (blue, green, red) obtained by 
the proposed approach (the right column) are more similar to the three-component 
PARAFAC model (the left column) than the four-component PARAFAC model (the 
center column). 

6. Summary. In this paper we have proposed three strategies to extend the 
framework of trace norm regularization to the estimation of partially observed low- 
rank tensors. The proposed approaches are formulated in convex optimization prob- 
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Fig. 5.5. Generalization performance of proposed methods on the amino acid fluorescence data 
is compared to conventional EM-based Tucker decomposition and PARAFAC. See also Figures 5.1 
and 5.4- 



PARAFAC(3) 



PARAFAC(4) 



Proposed(4) 




Fig. 5.6. Factors obtained by three- component PARAFAC (left), four- component PARAFAC 
(center), and the heuristic proposed in Section 3.4 (right) at the fraction of observation 0.5. Even 
when a redundant four- component PARAFAC is used in the post-processing, the proposed heuristic 
estimates the factors more reliably than directly applying the PARAFAC model. 



lems and the rank of the tensor decomposition is automatically determined through 
the optimization. 

In the simulated experiment, tensor completion using the "Constraint" approach 
showed nearly perfect reconstruction from only 35% observations. The proposed ap- 
proach shows a sharp threshold behaviour and we have empirically found that the 
fraction of samples at the threshold is roughly proportional to the sum of mode-A: 
ranks of the underlying tensor. 

We have also shown the weakness of the "Constraint" approach. When the un- 
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known tensor is only low-rank in certain mode, the assumption that the tensor is 
low-rank in every mode, which underlies the "Constraint" approach, is too strong. 
We have demonstrated that the "Mixture" approach is more effective in this case. 
The "Mixture" approach can automatically detect the rank-deficient mode and lead 
to better performance. 

In the amino acid fluorescence dataset, we have shown that the proposed "Con- 
straint" approach outperforms conventional EM-based Tucker decomposition and is 
comparable to PARAFAC model with the correct number of components. Moreover, 
we have demonstrated a simple heuristic to obtain a PARAFAC-style decomposition 
from the decomposition obtained by the proposed method. Moreover, we have shown 
that the proposed heuristic can reliably recover the true factors even when the number 
of PARAFAC factors is misspecified. 

The proposed approaches can be extended in many ways. For example, it would 
be important to handle non-Gaussian noise model [12, 20]; for example a tensor 
version of robust PCA [9] would be highly desirable. For classification over tensors, 
extension of the approach in [40] would be meaningful in applications including brain- 
computer interface; see also [35] for another recent approach. It is also important to 
extend the proposed approach to handle large scales tensors that cannot be kept in 
the RAM. Combination of the first-order optimization proposed by Acar et al. [1] 
with our approach is a promising direction. Moreover, in order to understand the 
threshold behaviour, further theoretical analysis is necessary. 

Appendix A. Computation of the dual objectives. In this Appendix, we 
show how we compute the dual objective values for the computation of the relative 
duality gap (4.8). 

A.l. Computation of dual objective for the "As a Matrix" approach. 

The dual problem of the constrained minimization problem (4.9) can be written as 
follows: 

maximize - — \\ttPk T a\\ 2 + y T flP k T a (A.l) 

aem N 2 11 11 

subject to ftP k T a = 0, ||A|| < 1. (A.2) 

Here Cl : M. N — > W N ~ M is the linear operator that reshapes the elements of a given 
N dimensional vector that correspond to the unobserved entries into an N — M 
dimensional vector. In addition, A £ K™*= xn \fc is the matricization of a and || • || is 
the spectral norm (maximum singular value). 

Note that the multiplier vector a* obtained through ADMM does not satisfy 
the above two constraints (A.2). Therefore, similar to the approach used in [45, 41], 
we apply the following transformations. First, we compute the projection a by 
projecting ct* to the equality constraint. This can be done easily by setting the 
elements of a* that correspond to unobserved entries to zero. Second, we compute 
the maximum singular value <7i of the matricization of a* and shrink a as follows: 

a 1 = min(l, l/eri)a*. 

Clearly this operation does not violate with the equality constraint. Finally we sub- 
stitute a* into the dual objective (A.l) to compute the relative duality gap as in 
Equation (4.8). 
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A. 2. Computation of dual objective for the "Constraint" approach. 

The dual problem of the constrained minimization problem (3.3) can be written as 
follows: 



. . A 
maximize 



n £ P k T a k 

k=l 



+ y T fi£P fe T a fe , (A.3) 
fe=i 



K 

subject to SI Pk T a k = 0, ||A fe || < 7 fe (k = 1, . . . , K). 

k=i 

Here the anti-observation operator is defined as in the last subsection, and A k S 
is the matricization of a k (k = l,...,K). 
In order to obtain a dual feasible point from the current multiplier vectors a£ 
(k = 1, . . . , X ), we apply similar transformations as in the last subsection. First, we 
compute the projection to the equality constraint. This can be done by computing 
the sum over a\, . . . , ot K for each unobserved entry. Then the sum divided by K is 
subtracted from each corresponding entry for k = 1, . . . , K. Let us denote by a k the 
multiplier vectors after the projection. Next, we compute the largest singular-values 
Cfc.i = <Ji(A k ) where A k is the matricization of the projected multiplier vector ctj, 
for k = 1,...,K. Now in order to enforce the inequality constraints, we define the 
shrinkage factor c as follows: 

c = min(l, 71/0-1,1, 72/0-2,1, • • -,7k/ctk,i)- (A.4) 

Using the above shrinkage factor, we obtain a dual feasible point a k as follows: 

ai=ca{ (k = l,...,K). 

Finally, we substitute a k into the dual objective (A.3) to compute the relative duality 
gap as in Equation (4.8). 

A.3. Computation of dual objective for the "Mixture" approach. The 

dual problem of the mixture formulation is already given in Equation (4.17). Making 
the implicit inequality constraints explicit, we can rewrite this as follows: 

maximize — — ||a|| 2 + ay, 

aER M 2 

subject to ||P fe f2 T o:|| < 7fe (k = l,...,K). 

Note that the norm in the second line should be interpreted as the spectral norm 
of the matricization of P k fl J a. Although the ADMM presented in Section 4.5 was 
designed to solve this dual formulation, we did not discuss how to evaluate the dual 
objective. Again the dual vector a* obtained through the ADMM does not satisfy 
the inequality constraints. 

In order to obtain a dual feasible point, we compute the largest singular-values 
Cfc,l = o-i (P k n T a) for k = 1, . . . , K. From the singular- values a k .i, we can compute 
the shrinkage factor c as in Equation (A.4) in the previous subsection. Finally, a dual 
feasible point can be obtained as a — cot, which we use for the computation of the 
relative duality gap (4.8). 
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